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Abstract 



Under a "positive curvature" assumption expressing a kind of metric er- 
godicity, we provide explicit non-asymptotic estimates for the rate of con- 
vergence of empirical means of Markov chains, together with a Gaussian or 
exponential control on the deviations of empirical means. 

The goal of the Markov chain Monte Carlo method is to provide an efficient way 
to approximate the integral vr(/) := J f{x) 7r(dx) of a function / under a finite mea- 
sure vr on some space X. This approach, which has been very successful, consists 
in constructing a hopefully easy-to-simulate Markov chain {Xi,X2, . . . , Xk, . . .) on 
X with stationary distribution tt, waiting for a time Tq (the burn-in) so that the 
chain gets close to its stationary distribution, and then estimating vr(/) by the 
empirical mean on the next T steps of the trajectory, with T large enough: 

, To+T 
*(/):=- 

k=To+l 

We refer e.g. to |RR04j for a review of the topic. 

Under suitable assumptions |MT93j . it is known that 7r(/) almost surely tends 
to vr(/) as T ^ oo, that the variance of 7r(/) decreases asymptotically like l/T and 
that a central limit theorem holds for the errors vr(/) — 7r(/). Unfortunately, these 
theorems are asymptotic only, and thus mainly of theoretical interest since they 
do not allow to give explicit confidence intervals for '/r(/) at a given time T. Some 
even say that confidence intervals disappeared the day MCMC methods appeared. 

In this paper, we aim at establishing rigorous non-asymptotic upper bounds for 
the error |vr(/) — '/r(/)|, which will provide good deviation estimates and confidence 
intervals for vr(/). An important point is that we will try to express all results in 
terms of explicit quantities that are readily computable given a choice of a Markov 
chain; and, at the same time, recover correct order of magnitudes in a surprising 
variety of examples. 

Our non-asymptotic estimates have the same qualitative behavior as theory 
predicts in the asymptotic regime: the variance of 7r(/) decreases like l/T, and the 
bias decreases exponentially in Tq. Moreover, we provide a Gaussian or exponential 
control on deviations of vr(/), which allows for good confidence intervals. Finally, 
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we find that the influence of the choice of the starting point on the variance of 
■7r(/) decreases hke l/T^. 

Our results hold under an assumption of positive curvature [Q1107l[Q1109| . which 
can be understood as a kind of "metric ergodicity". Not all Markov chains satisfy 
this assumption, but important examples include spin systems at high tempera- 
ture, several of the usual types of waiting queues, processes such as the Ornstein- 
Uhlenbeck process on or Brownian motion on positively curved manifolds. We 
refer to [Q1109| for more examples and discussions on how one can check this as- 
sumption, but let us stress out that, at least in principle, this curvature can be 
computed explicitly given a Markov transition kernel. This property or similar 
ones can be traced back to Dobrushin [Dob70l IDS85j . and have been used a few 
times in the Markov chain literature |CW94[ iDobMl IBD971 [DGWn4[ [JmlOTl 1011071 

lomMraioIi] . 

Similar concentration inequalities have been recently investigated in |Jouj for 
time-continuous Markov jump processes. More precisely, the first author obtained 
Poisson-type tail estimates for Markov processes with positive Wasserstein curva- 
ture. Actually, the latter is nothing but a continuous-time version of the Ricci 
curvature emphasized in the present paper, so that we expect to recover such 
results by a simple limiting argument (cf. Section [2]). 

Estimates for the deviations of empirical means have previously been given 
in |Lez98j using the spectral gap of the Markov chain, under different conditions 
(namely that the chain is reversible, that the law of the initial point has a density 
w.r.t. vr, and that / is bounded). The positive curvature assumption, which is 
a stronger property than the spectral gap used by Lezaud, allows to lift these 
restrictions: our results apply to an arbitrary starting point, the function / only 
has to be Lipschitz, and reversibility plays no particular role. In a series of papers 
(for instance [WuOOl ICGQ81 IGLWY] ). the spectral approach has been extended 
into a general framework for deviations of empirical means using various types of 
functional inequalities; in particular |GLWY| contains a very nice characterization 
of asymptotic variance of empirical means of Lipschitz functions in terms of a 
functional inequality Wil satisfied by the invariant distribution. 

1 Preliminaries and statement of the results 
1.1 Notation 

Markov chains. In this paper, we consider a Markov chain (XAr)7veN in a Polish 
(i.e. metric, complete, separable) state space {'V,d). The associated transition 
kernel is denoted {Px)xex where each Px is a probability measure on X, so that 
Px{dy) is the transition probability from x to y. The A^-step transition kernel is 
defined inductively as 




2 



(with := Px). The distribution at time N of the Markov chain given the initial 
probability measure fi is the measure ^P^ given by 

^P^(dy)= / Pi^(dy)^(dx). 
Jx 

Let as usual E^^ denote the expectation of a random variable knowing that the 
initial point of the Markov chain is x. For any measurable function f : X M., 
define the iterated averaging operator as 

:= EJ{X^) = [ f{y) P^{dy), x&X. 
Jx 

A probability measure tt on A" is said to be invariant for the chain if vr = ttP. 
Under suitable assumptions on the Markov chain {Xj^)iy^^, such an invariant 
measure vr exists and is unique, as we will see below. 

Denote by Vd{X) the set of those probability measures ^ on X such that 
xo) /u(dy) < oo for some (or equivalently for all) xq G X. We will always 
assume that the map x i-^ is measurable, and that Px G VdiX) for every x G X. 
These assumptions are always satisfied in practice. 

Wasserstein distance. The transportation distance, or Wasserstein dis- 
tance, between two probability measures /ii,/i2 G T^diX) represents the "best" 
way to send /ii on /i2 so that on average, points are moved by the smallest possible 
distance. It is defined [Vil03j as 

Wi{ni,fi2)-= inf / / d{x,y)(,{dx,dy), 
^en{fj.i,fi2) Jx Jx 

where n(/ii,;U2) is the set of probability measures ^ on Vd{X x X) with marginals 
//I and ^2, i.e. such that Jy^{dx,dy) = /ii(dx) and J^S,{dx,dy) = fJ.2{dy). (So 
intuitively ^(dx,dy) represents the amount of mass travelling from x to y.) 

Ricci curvature of a Markov chain. Our main assumption in this paper is the 
following, which can be seen geometrically as a "positive Ricci curvature" [O1109| 
property of the Markov chain. 

Standing assumption. 

There exists k > such that 

WiiPx,Py) ^ il-K)d{x,y) 

for any x,y £ X. 

In practice, it is not necessary to compute the exact value of the Wasserstein 
distance Wi{Px, Py)'- it is enough to exhibit one choice of S,{dx,dy) providing a 
good value of Wi{Px,Py)- 
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An important remark is that on a "geodesic" space A', it is sufficient to control 
Wi{Px, Py) only for nearby points x,y ^ X , and not for all pairs of points (Propo- 
sition 19 in |O1109| ). For instance, on a graph it is enough to check the assumption 
on pairs of neighbors. 

These remarks make the assumption possible to check in practice, as we will 
see from the examples below. 

More notation: eccentricity, diffusion constant, local dimension, granu- 
larity. Under the assumption above. Corollary 21 in [O1109| entails the existence 
of a unique invariant measure tt G Vd{X)^ with moreover the following geometric 
ergodicity in VFi-distance (instead of the classical total variation distance, which 
is obtained by choosing the trivial metric d{x,y) = l|^_^j^|): 

W^i(/.P^,7r)<(l-K)^W^i(/x,7r), (1) 

and in particular 

iyi(Pi^,7r)^(l-K)^i?(x), (2) 
where the eccentricity E at point x £ X \s defined as 



E{x) := / d(rc,y)7r(dy). 
Jx 

Note that eccentricity satisfies the bounds |O1109| : 

diam^Y; 

E{xq) + d{x,XQ), xq^X; 

- li'x'^i^^y)P^^^y)- 



E{x) ^ < 



Such a priori estimates are useful in various settings. In particular, the last one is 
"local" in the sense that it is easily computable given the Markov kernel Px- 

Let us also introduce the coarse diffusion constant a{x) of the Markov chain 
at a point x £ X, which controls the size of the steps, defined by 



crixf -.= ^11 d{y,zfPx{dy)Px{dz). 



Let the local dimension at point x £ X he given by 

JJd{y,zfPx{dy)Px[dz) 



rir := inf 



/ l-Lipscnitz 



^ 1. 



Let the granularity of the Markov chain be 



(Too := sup diam SuppP^. 

^ x£X 
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which we will often assume to be finite. 

For example, for the simple random walk in a graph we have <Too ^ 1 and 
a{xf ^ 2. 

Finally, we will denote by |HlLip the usual Lipschitz seminorm of a function / 
on X: 

11,11 l/W-/(g)l 

1.2 Results 

Back to the introduction, choosing integers T ^ 1 and Tq ^ and setting 



1 To+T 
vr(/):=- fi^k 



k=To+l 

the purpose of this paper is to understand how fast the difference \Tr{f) — '/r(/)| 
goes to as T goes to infinity, for a large class of functions /. Namely, we will 
consider Lipschitz functions (recall that Corollary 21 in [Q1109| implies that all 
Lipschitz functions are vr-integrable) . 

Bias and non-asymptotic variance. Our first interest is in the non-asymptotic 
mean quadratic error 

E^[\fc{f)-7T{f)f 

given any starting point x £ X for the Markov chain. 

There are two contributions to this error: a variance part, controlling how 
7r(/) differs between two independent runs both starting at x, and a bias part, 
which is the difference between vr(/) and the average value of 7r(/) starting at x. 
Namely, the mean quadratic error decomposes as the sum of the squared bias plus 
the variance: 



^(/)-vr(/)|' =|E,7r(/)-7r(/)|^ + Var,7r(/) (3) 
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where Var,7r(/) := [|^(/) - E,^(/)p 

As we will see, these two terms have different behaviors depending on Tq and 
T. For instance, the bias is expected to decrease exponentially fast as the burn- 
in period Tq is large, whereas if T is fixed, the variance term does not vanish as 
To oo. 

Let us start with control of the bias term, which depends, of course, on the 
starting point of the Markov chain. All proofs are postponed to Section [3l 

Proposition 1 (Bias of empirical means). 

For any Lipschitz function / : — > M, we have the upper bound on the bias: 

\E.Mf) - ^(/)l ^ E{x) ll/llLip . (4) 
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The variance term is more delicate to control. For comparison, let us first 
mention that under the invariant measure vr, the variance of a Lipschitz function 
/ is bounded as follows (and this estimate is often sharp): 



Var./^||/||2 sup^ (5) 



(cf. Lemma [9] below or Proposition 32 in [OllOQj ). This implies that, were one able 
to sample from the invariant distribution vr, the ordinary Monte Carlo method of 
estimating 7r(/) by the average over T independent samples would yield a variance 



bounded by swp^^^ ir^' Because of correlations, this does not hold for the 

MCMC method. Nevertheless we get the following. 

Theorem 2 (Variance of empirical means, 1). 

Provided the inequalities make sense, we have 

^sup,.;,^ ifTo = 0; 

^tIt^ (1 + ^) sup^.gA' ^ otherwise. 



The most important feature of this formula is the 1/kT factor, which means 
there is an additional 1/k factor with respect to the ordinary Monte Carlo case. 
Intuitively, the idea is that correlations disappear after roughly 1/k steps, and so 
T steps of the MCMC method are "worth" only kT independent samples. This 
1/kT factor will appear repeatedly in our text. 

To get convinced that this 1/kT factor in our estimate ([6]) is natural, observe 
that if the burn-in Tq is large enough, then the law of Xtq will be close to the 
invariant distribution ir so that Vara;7r(/) will behave like Varxo~7r '?'■(/)• Then we 
have 

yar:x,^.Hf) = ^iY.y^Txor.Afm) + 2 Covxo~.(/(X,),/(X,)) I 

\i=l is;i<i^T / 

but our assumption on k easily implies that correlations decrease exponentially fast 
with rate 1—k so that (at least in the reversible case) we have Covxo~7r(/(-'^o)) fi^t)) ^ 
(1 — k)* Varxo~7r /(^o)- In particular, for any fixed i we have Varxo~7r(/(-'^i)) + 
2Ej>j Covxo~7r(/(-'^i)>/(-'^i)) ^ ^"^aiXor^TT f{Xi). Plugging this into the above 
yields Varxo~7r '?'■(/) ^ W^^^'^J'' which explains the 1/kT factor. 

Unbounded difTusion constant. In the formulas above, a supremum of a{x)'^ /ux 
appeared. This is fine when considering e.g. the simple random walk on a graph, 
because then a{x)'^ /ux « 1 for all x ^ X. However, in some situations (for instance 
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binomial distributions on the cube), this supremum is much larger than a typical 
value, and, in some continuous-time limits on infinite spaces, the supremum may 
even be infinite (as in the example of the M/M/oo queueing process below). For 
such situations, one expects the variance to depend, asymptotically, on the average 
of o{x)^ jux under the invariant measure vr, rather than its supremum. 

The next result generalizes Theorem [2] to Markov chains with unbounded diffu- 
sion constant a{x)^ jrix- We will assume that aixf' jux has at most linear growth 
(this is consistent with the usual theorems on Markov processes, in which linear 
growth on the coefficients of the diffusion is usually assumed). 

Of course, if one starts the Markov chain at a point x with large o-(x)^, meaning 
that the chain has a large diffusion constant at the beginning, then the variance 
of the empirical means started at x will be accordingly large, at least for small T . 
This gives rise, in the estimates below, to a variance term depending on x; it so 
happens that this terms decreases like with time. 

Theorem 3 (Variance of empirical means, 2). 

Assume that there exists a Lipschitz function S with \\S\\i^^p ^ C such that 

^ Six), xeX. 

Then the variance of the empirical mean is bounded as follows: 

M|h iE^S+^E{x)) ifTo = 0; 

Var^7r(/) ^ <( ^ (7) 
MlLiP ( (1 + ^) E^S + ^^^i^^ E{x)) otherwise. 



In particular, the upper bound behaves asymptotically like ||/||Lip ^-kS/kT, 
with a correction of order l/(Kr)^ depending on the initial point x. 

Note that in some situations, E^5 is known in advance from theoretical reasons. 
In general, it is always possible to chose any origin xq G X (which may or may not 
be the initial point x) and apply the estimate K-^S ^ S{xo) + CE[xq). 



Concentration results. To get good confidence intervals for 7r(/) it is necessary 
to investigate deviations, i.e. the behavior of the probabilities 

P,(|7r(/)-7r(/)| >r), 

which reduce to deviations of the centered empirical mean 

IP.(|^(/)-E.7r(/)| >r) 

if the bias is known. Of course the Bienayme-Chebyshev inequality states that 
(!''''(/) ~ ^xT^{f )\ > r) ^ Vara: 7r(/) ^ ^^^^ ^j^.^ docs not decrease very fast with r, 
and Gaussian-type deviation estimates are often necessary to get good confidence 



7 



intervals. Our next results show that the probability of a deviation of size r 
is bounded by an explicit Gaussian or exponential term. (The same Gaussian- 
exponential transition also appears in [Lez98| and other works.) 

Of course, deviations for the function 10/ are 10 times bigger than deviations 
for f, so we will use the rescaled deviation ^^-^l ^.f"'^^-^'' . 

■' ' ll/llLip 

In the sequel, we assume that cJoo < oo. Once more the proofs of the following 
results are established in Section [3l 

Theorem 4 (Concentration of empirical means, 1). 

Denote by V'^ the quantity: 



Then empirical means satisfy the foUowing concentration result: 



„2 



|7r(/)-E,7r(/)| ^ \ ^ I 2e lil^ ifrG(0,w) 



Lip / [ 2 e 12£roo if J- ^ 

where the boundary of the Gaussian window is given by r^j^ax := 4:V'^KT/3aoo- 

Note that rj^ax ^ I/VT for large T, so that the Gaussian window gets better 
and better when normalizing by the standard deviation. This is in accordance 
with the central limit theorem for Markov chains |MT93j . 

As for the case of variance above, we also provide an estimate using the average 
of aixf' jux rather than its supremum. 

Theorem 5 (Concentration of empirical means, 2). 

Assume that there exists a Lipschitz function S with II^IIl^p ^ C such that 

^ Six), x£X. 

Denote by the following term depending on the initial condition x: 

Tiien the following concentration inequality holds: 

^ ( |7r(/) -E,.7r(/^ ^ ^ I ^"'^ ^ ^ (9) 



'max 



where r^ax := 4^4^ kT/ max{2C, 3cToo}- 

The two quantities V'^ and in these theorems are essentially similar to the 
estimates of the empirical variance Var^; TT{f) given in Theorems [2] and [31 so that 
the same comments apply. 
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Randomizing the starting point. As can be seen from the above, we are 
mainly concerned with Markov chains starting from a deterministic point. The 
random case might be treated as follows. Assume that the starting point Xq of 
the chain is taken at random according to some probability measure Then an 
additional variance term appears in the variance/bias decomposition namely: 



E 



|v^(/)-v^(/)r 



|IExo~^vr(/)-vr(/)|^ + 
+ Var[E(7r(/)|Xo)]. 



Var^ 7r(/) /x(dx) 



X 



The new variance term depends on how "spread" the initial distribution /i is and 
can be easily bounded. Indeed we have 

To+T 



so that if / is, say, 1-Lipschitz, 
Var[E(vr(/)|Xo)] ^ 



2T2 
(1- 



X JX 



k=To+l 



^ {P'f{x)-P'f{y)) 
fc=To+l 




X Jx 



since P^/ is (1 



-Lipschitz. This is fast-decreasing both in Tq and T. 



Note also that the bias can be significantly reduced if // is known, for some 
reason, to be close to the invariant measure vr. More precisely, the eccentricity 
E{x) in the bias formula ^ above is replaced with the transportation distance 
Wl{^i,7:). 



Convergence of vr to vr. The fact that tt yields estimates close to vr when 
integrating Lipschitz functions does not mean that vr itself is close to vr. To see 
this, consider the simple case when A' is a set of N elements equipped with any 
metric. Consider the trivial Markov chain on X which sends every point x & X 
to the uniform probability measure on X (so that k = 1 and the MCMC method 
reduces to the ordinary Monte Carlo method). Then it is clear that for any function 
/, the bias vanishes and the empirical variance is 

Vara; vr(/) = ^ Var^ / 

which in particular does not depend directly on N and allows to estimate vr(/) 
with a sample size independent of N , as is well-known to any statistician. But the 
empirical measure vr is a sum of Dirac masses at T points, so that its Wasserstein 
distance to the uniform measure cannot be small unless T is comparable to N . 



9 



This may seem to contradict the Kantorovich-Rubinstein duality theorem 
|Vil03j . which states that 



VF^(7r,7r)= sup 7r(/)-7r(/). 

/ l-Lipschitz 

Indeed, we know that for a function / fixed in advance, very probably vr(/) is 
close to vr(/). But for every realization of the random measure tt there may be 
a particular function / yielding a large error. What is true, is that the averaged 
empirical measure E^^tt starting at x tends to vr fast enough, namely 

l^i(E,.7r,7r)<^ % E{x) 

which is just a restatement of our bias estimate above (Proposition [T|). But as we 
have just seen, KxWi{Tf,TT) is generally much larger. 



2 Examples and applications 

We now show how these results can be applied to various settings where the positive 
curvature assumption is satisfied, ranging from discrete product spaces to waiting 
queues, diffusions on or manifolds, and spin systems. In several examples, our 
results improve on the literature. 



A simple example: discrete product spaces. Let us first consider a very 
simple example. This is mainly illustrative, as in this case the invariant measure vr 
is very easy to simulate. Let X = {0, 1}^ be the space of A^-bit sequences equipped 
with the uniform probability measure. We shall use the Hamming distance on X, 
namely, the distance between two sequences of O's and I's is the number of positions 
at which they differ. The Markov chain we shall consider consists, at each step, in 
choosing a position 1 ^ i ^ N at random, and replacing the i-th bit of the sequence 
with either a or a 1 with probability 1/2. Namely, starting at x = (xi, . . . ,xj\f) 
we have Px{x) = 1/2 and Px{xi, . . . ,1 — Xi, . . . , xn) = 1/2N. 

A typical Lipschitz function for this example is the function /o equal to the 
proportion of "0" bits in the sequence, for which II/o Hlip = 1/-/V- 

Then an elementary computation (Example 8 in [O1109| ) shows that k = 1/A^, 
so that our theorems apply. The various quantities of interest are estimated as 
cToo = 1, cr{x)'^ ^ 2 and ^ 1; using Remark 40 in |O1109j yields a slightly better 
estimate ^ 1/2. Moreover E{x) = N/2 for any x € X. 

Then our bias estimate (jlj for a Lipschitz function / is 

|E.7r(/) - 7r(/)| ^ ^ (1 - l/Nf^^' \\f\\^,^ ^ — e"^"/^ \\f\\^^ 

So taking Tq 2A^logA^ is enough to ensure small bias. This estimate of the 
mixing time is known to be the correct order of magnitude: indeed, if each bit has 
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been updated at least once (which occurs after a time ^ N log N) then the measure 
is exactly the invariant measure and so, under this event, the bias exactly vanishes. 
In contrast, the classical estimate using the spectral gap yields only 0{N'^) for the 
mixing time |DS96| . 

The variance estimate ^ reads 

Var7r(/)^— (l + iV/T) \\f\\l^ 

so that, for example for the function /o above, taking T N will yield a variance 
1/iV, the same order of magnitude as the variance of /o under the uniform 

measure. (With a little work, one can convince oneself that this order of magnitude 

is correct for large T.) 

The concentration result ([8]) reads, say with Tq = and for the Gaussian part: 

P. (|7r(/) - E,7r(/)| ^ r) ^ 2e-^'-'/«^'ll^llLip 

so that e.g. for /o we simply get get 2e~'^^'^^^. For comparison, the spectral estimate 
from |Lez98| behaves like 2^/^e~^'^ for small r, so that we roughly improve the 
estimate by a factor 2-^/^, due to the fact that the density of the law of the starting 
point (a Dirac mass) plays no role in our setting. 

Heat bath for the Ising model. Let G be a finite graph. Consider the classical 
Ising model from statistical mechanics |Mar04| . namely the configuration space 
X := { — 1,1}'^ together with the energy function U{s) := —J^x-^yeG ^(^)^(y) ~ 
hY^^^Q s{x) for s £ X, where /i G M. For some /5 ^ 0, equip X with the Gibbs 
distribution vr := e~^^ /Z where as usual Z := e~^^^'^\ The distance between 
two states is defined as the number of vertices of G at which their values differ, 
namely d{s, s') := | ExsG \s{x) - s'{x) \ . 

For s £ X and x G G, denote by Sx+ and Sx- the states obtained from s 
by setting Sx+{x) = +1 and Sx-{x) = —1, respectively. Consider the following 
random walk on X, known as the heat bath or Glauber dynamics [Mar04| : at 
each step, a vertex x £ G is chosen at random, and a new value for s{x) is 
picked according to local equilibrium, i.e. s{x) is set to 1 or —1 with probabilities 
proportional to e~^^^^'^+^ and e~^^^^''~'^ respectively (note that only the neighbors 
of X infiuence the ratio of these probabilities) . The Gibbs distribution vr is invariant 
(and reversible) for this Markov chain. 

When /3 = 0, this Markov chain is identical to the Markov chain on {0, 1}^ 
described above, with = Therefore, it comes as no surprise that for (3 small 
enough, curvature is positive. More precisely one finds |Q1109| 

1 / - e-^ 



where Vmax is the maximal valency of a vertex of G. In particular, if /3 < 
i In ^ "^^^^j ^ then k is positive. This is not surprising, as the current research 
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interest in transportation distances can be traced back to |Dob70j (where the name 
Vasershtein distance is introduced), in which a criterion for convergence of spin 
systems is introduced. Dobrushin's criterion was a contraction property of the 
Markov chain in Wasserstein distance, and thus, in this context, precisely coin- 
cides with our notion of k > 0. (See also [Perj .) 

Let us see how our theorems apply, for example, to the magnetization fo{s) := 
\h\ SxgG ^{^)- With the metric we use we have || /oil Lip = y^- 

Let 7 := 1 — Wmax f^^f^) so that = jg], and assume that 7 > 0. Using 
the gross inequalities cJoo = 1, o'(s)^ ^ 2 and ^ 1, the variance estimate of 
Theorem [2] reads, with Tq = 0: 

Vars7r(/o) ^ ^ 

where s is any initial configuration. For example, taking T ~ |G| (i.e. each site of 
G is updated a few times by the heat bath) ensures that Var^ 7r(/o) is of the same 
order of magnitude as the variance of /o under the invariant measure. 

Theorem [4] provides a Gaussian estimate for deviations, with similar variance 
up to numerical constants. The transition for Gaussian to non-Gaussian regime 
becomes very relevant when the external magnetic field h is large, because then 
the number of spins opposing the magnetic field has a Poisson-like rather than 
Gaussian-like behavior (compare Section 3.3.3 in [O1109| ). 

The bias is controlled as follows: using E{s) ^ diamA' = |G| in Propo- 
sition [U one finds |E,7r(/o) - 7r(/o)| < 2|G|(1 - 7/1^1)^0/7^ so that taking 
Tq ^ \ G\ log |G| is a good choice. 

These results are not easily compared with the literature, which often focusses 
on getting non-explicit constants for systems of infinite size [Mar04j . However, 
we have seen that even in the case /? = our estimates improve on the spectral 
estimate, and our results provide very explicit bounds on the time necessary to 
run a heat bath simulation, at least for /3 not too large. 



The M/M/00 queueing process. We now focus on a continuous-time example, 
namely the M/M/00 queueing process. This is a continuous-time Markov chain 
{Xt)t^o on N with transition kernel given for small t by 



Pt{x,y) 



' \t + o{t) ify = x + l; 

xt + o{t) if y = X — 1; 

1- {X + x)t + o{t) if y = X, 



where A is a positive parameter. The (reversible) invariant measure is the Poisson 
distribution tt on N with parameter A. Although this process is very simple in 
appearance, the unboundedness of the associated transition rates makes the deter- 
mination of concentration inequalities technically challenging. Here we will get a 
convenient concentration inequality for Lipschitz functions / with respect to the 
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classical metric on N, in contrast with the situation of fjou] where Poisson-like 
concentration estimates are provided for Lipschitz functions with respect to an ad 
hoc metric. The techniques used here allow us to overcome this difficulty. 

First, let us consider, given d S N*, d > A, the so-called binomial Markov chain 
(X^^)ArgN on {0, 1, . . . with transition probabilities given by 

^(1-1) if y = x + l- 



Ax 



'^ + (i-^)(i-f) if y 



X. 



The invariant measure is the binomial distribution vr^"^) on {0, 1, . . . , d} with pa- 
rameters d and X/d. It is not difficult to show that the Ricci curvature is n = l/d 
and that cr(x)^ ^ (A + x)/d for G {0, 1, . . . , d}. 

But now, take instead the continuous-time version of the above, namely the 
Markov process {x\f^)t^Q whose transition kernel is defined for any t ^ as 



Pi''\x, y) = e-* ^ ^ {P^'^f{y). X, y G {0, 1, . . . , 4- 



A:=0 

As (i — > oo, the invariant measure vr*^'^) converges weakly to the Poisson measure 
TT, which is nothing but the invariant measure of the M/M/oo queueing process. 
One can check (using e.g. Theorem 4.8 in [JS03j ) that the process (x|°'-')t^o sped 
up by a factor d converges to the M/M/oo queueing process {Xt)t'^Q in a suitable 
sense (in the Skorokhod space of cadlag functions equipped with the Skorokhod 
topology). 

To derive a concentration inequality for the empirical mean 7r(/) := t^^ Jg fi^s) ds, 
where / is 1-Lipschitz on N and time t is fixed, we proceed as follows. First, we 
will obtain a concentration estimate for the continuous-time binomial Markov chain 
(Xj'^^)(^o by using Theorem [5] for the chain (X^^)7veN with e — > 0, and then we 
will approximate the M/M/oo queueing process {Xt)t^o by the sped- up process 
i^td)t^o with d^ CO. 

For small e, the Markov chain (X^^)^?^^ has Ricci curvature bounded below 
by e/d, eccentricity E{x) ^ x -|- £'(0) = x -|- A, square diffusion constant ct(x)^ of 
order e(A -|- x)/d, and Ux ^ 1, so that we may take ^(x) := A -|- x in Theorems [3] 
and El above (with Tq = for simplicity). Let / be a 1-Lipschitz function. For a 
given t > we have P^-almost surely the Riemann approximation: 



:= 7 r /(^i'^) = ^lim 7r('^)'^(/) 



t Jo T^+oo 



where TT^'^^''^{f) := Ylk=i f (-^kt/Tj • ^° ^Pplyi^g Theorem [5] to the Markov 
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chain {X 



with e = t/r, we get by Fatou's lemma: 



(|7r('^)(/)-E,7rW(/) 



> r 



^ lim inf I 



TT 



('^)'^(/)-E.vrW'^(/) 



> r 



2 e i6d(2At+''(A+a=)d) if r e (0, r^^L) 



2 e i2d 



if r ^ r^lx 



where rmax := (8At + 4(A + x)d) /3i. Finally, we approximate (Xt)t^o by the sped- 
up process {X^f^)t^Q with d — > oo and apply Fatou's lemma again to obtain the 
following. 

Corollary 6. 

Let {Xs)s^o be the M/M/oo queueing process with parameter X. Let / : N ^ 
M be a 1-Lipschitz function. Then for any t > 0, the empirical mean vr(/) : = 
J^^f^ f{Xs) ds under the process starting at x G N satisfies the concentration 
inequahty 



\i\Hf)-^xHf)\>r) 



2 e i6(2A+{A+.)/t) if r e (0, r^ax) 
2e"S if r ^max 



where rr, 



{8Xt + 4{X + x)) /3t. 



Let us mention that a somewhat similar, albeit much less explicit, concentration 
inequality has been derived in [GLWYj via transportation-information inequalities 
and a drift condition of Lyapunov-type. 

Our results generalize to other kinds of waiting queues, such as queues with a 
finite number of servers and positive abandon rate. 



Euler scheme for diffusions. Let {Xt)t^o be the solution of the following 
stochastic differential equation on the Euclidean space W^: 

dXt = h{Xt) dt + V2p{Xt) dWt 

where {Wt)t^Q is a standard Brownian motion in M"^, the function 6 : M*^ ^ M'^ 
is measurable, as is the d x d matrix-valued function p. For a given matrix A, 
we define the Hilbert-Schmidt norm 11^ I 



1^1 



\\Av\\ 



HS 



V trAA* and the operator norm 



We assume that the following stabihty condition |BHW97l[DGW04j is satisfied: 
(C) the functions b and p are Lipschitz, and there exists a > such that 



\\pix) - piy)\\us + {x-y,b{x) - b{y)) ^ -a \\x - y\ 



x,y e 
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A typical example is the Ornstein-Uhlenbeck process, defined by p = Id and 
b{x) = —X. As we will see, this assumption implies that k > 0. 

The application of Theorems [H H] and [5] on this example requires careful ap- 
proximation arguments (see below). The result is the following. 

Corollary 7. 

Let 7r(/) := J^^q f{Xs) ds be the empirical mean of the l-Lipschitz function 
/ : M'^ ^ M under the diffusion process {Xt)t'^o above, starting at point x. Let 
5 : A" — > M be a C-Lipschitz function with S{x) ^ ^ 11/0(2^) llRd. Set 

,2 1 „ , , CE{x) 



at a^t"^ 
Then one has Var^ 7r(/) ^ and 



(|7r(/)-E,7r(/)| >r) ^ <^ 



„2 



2 e «^ if r G (0, 



atr 



2e 8c if r ^max 



where v^g^^ : — 2V,^at/C. 

An interesting case is when p is constant or bounded, in which case one can 

take S{x) := sup^. ^''^^J^^^''- go that C = 0. Then rmax = oo and the exponential 
regime disappears. For this particular case our result is comparable to [GLWYj . 
but note however that their result requires some regularity on the distribution of 
the starting point of the process, in contrast with ours. 

Note that the final result features the average of S under the invariant distri- 
bution. Sometimes this value is known from theoretical reasons, but in any case 
the assumption (C) implies very explicit bounds on the expectation of d{xo,x)'^ 
under the invariant measure vr |BHW97j . which can be used to bound E^S knowing 
iS'(xo), as well as to bound E{x). 

1 1 2 

The Lipschitz growth of \\p\\^d allows to treat stochastic differential equations 
where the diffusion constant p grows like ^/x, such as naturally appear in popula- 
tion dynamics or superprocesses. 
Proof. 

Consider the underlying Euler scheme with (small) constant step 6t for the stochas- 
tic differential equation above, i.e. the Markov chain {X^^)NeN defined by 



= X);'> + b{X):;'>)6t + V26t p{X'->) 

where (Y^) is any sequence of i.i.d. standard Gaussian random vectors. When 
6t 0, this process tends to a weak solution of the stochastic differential equation 
|BHW97| . 

Let us see how Theorems [3l [4] and [5] may be applied. The measure Px is 
a Gaussian with expectation x + b{x)6t and covariance matrix 26tpp*{x). Let 
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{X^j^\x))NeN be the chain starting at x. Under (C), we have 



E 



|xf = \\x-yf + 26t{x-y,b{x)-b{y)) 

+26t\\p{x)-p{y)\gs + 6t^b{x)-b{y)f 
^ \\x-yf {l-a6t+0{6t^))^ , 

so that we obtain k ^ a 6t + 0{6t'^). Moreover, the diffusion constant a{x) is given 
by 



aix)' 



III \\y - zf Px{dy)Px{dz) 



by a direct computation. 

Next, using the Poincare inequahty for Gaussian measures in W^, with a httle 
work one gets that the local dimension is 



rir, 



IIp(3;)IIhs 



For example, if p is the d x d identity matrix we have Ux = d, whereas Hx = 1 li p 
is of rank 1. 



So we get that 



is bounded by the function 



S{x) :=- Mx)\\i, + 0{6t). 

However, here we have cJoo = oo. This can be circumvented either by directly 
plugging into Lemma [TO] the well-known Laplace transform estimate for Lipschitz 
functions of Gaussian variables, or slightly changing the approximation scheme 
as follows. Let us assume that sup^ ||p(^)|lMd < oo. Now, replace the Gaussian 
random vectors Yat with random vectors whose law is supported in a large ball 
of radius R and approximates a Gaussian (the convergence theorems of jBHW97| 
cover this situation as well). Then we have cJoo = RV^M sup^ ||p(x)||gd. This 
modifies the quantities a{x)'^ and p{x) by a factor at most 1 + o(l) as i? — > oo. 

Therefore, provided S is Lipschitz, we can apply Theorem [5] to the empirical 
mean ■7r(/) := J^^q fi^s) ds by using the Euler scheme at time T = t/6t with 
6t 0, and using Fatou's lemma as we did above for the case of the M/M /oo 
process. Note in particular that aoo ^ as 5t — > 0, so that CToo will disappear 
from Tmax in the final result. 

Finally, the constraint sup^, < oo can be lifted by considering that, 

under our Lipschitz growth assumptions on b and ||/o(a;)||^ti, with arbitrary high 
probability the process does not leave a compact set and so, up to an arbitrarily 
small error, the deviation probabilities considered depend only on the behavior of 
p and 6 in a compact set. □ 
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Diffusions on positively curved manifolds. Consider a diffusion process 
{Xt)t^Q on a smooth, compact A^-dimensional Riemannian manifold M, given by 
the stochastic differential equation 

dXt = bdt + V2dBt 

with infinitesimal generator 

L := A + 6-V 

where 6 is a vector field on M, A is the Laplace-Beltrami operator and Bt is the 
standard Brownian motion in the Riemannian manifold M. The Ricci curvature 
of this operator in the Bakry-Emery sense |BE85| . applied to a tangent vector v, 
is Ric(w, f ) — f • 6 where Ric is the usual Ricci tensor. Assume that this quantity 
is at least K for any unit tangent vector v. 

Consider as above the Euler approximation scheme at time 6t for this stochastic 
differential equation: starting at a point x, follow the flow of b for a time 5t, to 
obtain a point x'; now take a random tangent vector w at x' whose law is a 
Gaussian in the tangent plane at x' with covariance matrix equal to the metric, 
and follow the geodesic generated by w for a time y/25t. Define Px to be the law of 
the point so obtained. When 5t — > 0, this Markov chain approximates the process 
{Xt)t^Q (see e.g. Section 1.4 in [Bis81| ). Just as above, actually the Gaussian law 
has to be truncated to a large ball so that (Too < oo. 

For this Euler approximation, we have k ^ K5t + 0{5t^/'^) where K is a lower 
bound for Ricci-Bakry-Emery curvature [O1109| . We have cr(x)^ = 2N6t+0{5t^/'^) 
and Ux = N+0{^/6i). The details are omitted, as they are very similar to the case 
of M*^ above, except that in a neighborhood of size of a given point, distances 
are distorted by a factor 1 it 0(\/5t) w.r.t. the Euclidean case. We restrict the 
statement to compact manifolds so that the constants hidden in the 0() notation 
are uniform in x. 

So applying Theorem |4] to the Euler scheme at time T = t/6t we get: 
Corollary 8. 

Let {Xt)t^o be a process as above on a smooth, compact N -dimensional Rieman- 
nian manifold X, with Bakry-Emery curvature at least K > 0. Let ■7r(/) : = 
f^^Q f{Xs) ds be the empirical mean of the 1-Lipschitz function / : A:" — > M 
under the diffusion process (Xt) starting at some point x £ X. Then 

P,(|7r(/) -E,7r(/)| > r) ^ 2e-T. 

Once more, a related estimate appears in |GLWY| . except that their result 
features an additional factor ||d/3/d7r||2 where /? is the law of the initial point of 
the Markov chain and vr is the invariant distribution, thus preventing it from being 
applied with (3 a Dirac measure at x. 
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Non-linear state space models. Given a Polisii state space {X,d), we consider 
the Markov chain {Xn)n€N solution of the following equation 

which models a noisy dynamical system. Here (VF7v)ArgN is a sequence of i.i.d. 
random variables with values in some parameter space, with common distribution 
fi. We assume that there exists some r < 1 such that 

EdiFix,Wi),Fiy,Wi)) i:rd{x,y), x,yGX, (10) 

and that moreover the following function is L'^-Lipschitz on X: 

x^E [d{F{x, Wi), Fix, W2)f] . 

Note that the assumption ([TO]) already appears in [DGW04| (Condition (3.3)) to 
study the propagation of Gaussian concentration to path-dependent functionals of 
{XN)N&n- 

Since the transition probability Px is the image measure of fi by the function 
F(x, •), it is straightforward that the Ricci curvature k is at least 1 — r, which is 
positive. Hence we may apply Theorem [3] with the L^/(2(l — r))-Lipschitz function 

S{x) := ^^^-^E[d{F{x,Wi),F{x,W2))^] , 

to obtain the variance inequality: 

sup Var^,. 7r(/) ^ ■{ ^ / \ 
ll/llLip<i ( {(l + (T^) ^nS + j^^E{x)j otherwise. 

Note that to obtain a qualitative concentration estimate via Theorem [5], we need 
the additional assumption cJoo < cjo, which depends on the properties of fj, and of 
the function F{x, •) and states that at each step the noise has a bounded influence. 



3 Proofs 

3.1 Proof of Proposition [1] 

Let / be a 1-Lipschitz function. Let us recall from |O1109j that for k £ N, the 
function f is (1 — K)'^-Lipschitz. Then we have by the invariance of tt: 



|IExVr(/)-7r(/)| 



To+T 



^ / [P'f{x)-P'fiy))7r{dy) 

- i^-^f y) vr(dy) 

d{x,y)TT{dy), 



fc=To+i 
:i - k)^o+i 



kT 



X 



so that we obtain the result. 
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3.2 Proof of Theorems H and d 

Let us start with a variance-type result under the measure after N steps. The 
proof rehes on a simple induction argument and is left to the reader. 

Lemma 9. 

For any G N* and any Lipschitz function f on X , we have: 

N-l , 2\ 

p'^if) - {p^'f? ^ ii/iiLp E(i - '^)'^^-^-'^ p\"-)- (11) 

fc=o V / 

In particular if the rate x i— > a{x)'^ /ux is bounded then letting N tend to inhnity 
above entails a variance estimate under the invariant measure vr; 

Var./^ll/llLpSup^. (12) 



Now we are able to prove the variance bounds of Theorems [2] and [3l Given a 
1-Lipschitz function /, consider the functional 

1 ^ 

fxu...,XT-iixT) ■■= -^/(Xfc), 
k=l 

the others coordinates xi,... ,xt-i being fixed. The function fxi,...,XT-i is l/T- 
Lipschitz, hence l/ztT-Lipschitz since k ^ 1. Moreover, for each k G {T — 1,T — 
2,..., 2}, define by a downward induction the conditional expectation of Tt^f) 
knowing Xi = xi, . . . , Xk = Xk'. 



X 



fxu...,Xk_^{Xk) := I fxi,...,Xkixk+l)PxkidXk+l) 

and 



/^(Xi) := / fxiix2) Pxi{dX2)- 

Jx 



By Lemma 3.2 (step 1) in (Jouj . we know that fxi,...,Xk-i is Lipschitz with constant 
Sk, where 



1 '^-'^ 1 

j=0 

Hence we can use the variance bound pTIl with = 1 for the function fxi,...,xk-iJ 



19 



successively for k = T,T — 1, . . . ,2, to obtain: 

J XT 

^ I fx,,...,XT^,{xT-l?PxT-2idxT-l)---PxddX2)Pl°^\dxi) 
^ / f,„...,,^_,{xT-2?Pxr.3idxT-2)---PxMx2)Pl'^Hdxi) 

To+T-1 /^2\ 

(X 



jt=0 ^ ^ 



fc=To+l 

where in the last step we applied the variance inequality (fTTl) to the Lipschitz 
function f^, with = Tq + 1. Therefore we get 

\k=0 ^ ^ k=To+l ^ ^ 

Theorem [2] is a straightforward consequence of the latter inequality. To establish 
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Theorem [3l for instance jTj in the case Tq ^ 0, we rewrite the above as: 

To To+T-1 



[fc=0 A;=To+l 

{k=0 

To+T-1 

fc=To+l 

I ^ ^ fc=0 



To+T-1 
fc=To+l 

« 1 /A+^V^-S + ^^^i^^W 



Finally, the proof in the case Tq = is very similar and is omitted. 
3.3 Proof of Theorems H and [5] 

The proof of the concentration theorems [4] and [5] follows the same lines as that for 
variance above, except that Laplace transform estimates Ee^-^^'^^-'^ now play the 
role of the variance E[/^] - (E/)^. 

Assume that there exists a Lipschitz function 5 : A" — > M with ||5'||Ljp ^ C 
such that ^ 

^ S{x), x£X. 

Let us give first a result on the Laplace transform of Lipschitz funtions under 
the measure at time A*". 

Lemma 10. 

Let A G (O, ure, — \ ) ■ Then for any G N* and any -Sp-Lipschitz function f 



on X , we have: 

P^ie'f) ^ exp [xP^f + ^ E ^'^1 • (13) 

In the case C = 0, the same formula holds for any A G (0, ^^)- 
Proof. 

Let / be ^-Lipschitz. By Lemma 38 in |O1109| . we know that if 5 is an a-Lipschitz 
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function with a ^ 1 and if A G (0, then we have the estimate 

P(e^3) ^ exp {XPg + KX^a^S} , 

and by rescaling, the same holds for the function / with a = whenever A G 
(0, ^). Moreover the function f + ■^T.k=o is also ^-Lipschitz for 
any G N*, since A G (0, f^)- Hence the result follows by a simple induction 
argument. □ 

Now let us prove Theorem [U using again the notation of Section 13.21 above. 
Theorem m easily follows from Theorem [5] by taking 5 := srvp^^x n L letting 
C ^ in the formula (I9|). 

Let / be a 1-Lipschitz function on X and let A G ^0, max{4(7 60- — }) • Using the 
Laplace transform estimate (fT3]) with = 1 for the ^-Lipschitz functions 



4A ^"''"-^ 

1=0 

successively for A: = T — 1, T — 2, . . . , 2, we have: 











J XT 






gA/.l,.. 


J XT 


-1 




gA/:.!,.. 


J XT 


-2 








A/0(a::i)+- 


Jx 




^ e 





t'T-2 * 

r>l Qi 



XT-2 * 



where in the last line we applied the Laplace transform estimate (fTSll to the 
Lipschitz function 

z=o 

with = To + 1. Therefore we get: 

Finally, using Chebychev's inequality and optimizing in A G ^0, n^ax{4fi6cr — ]") 
tails the result. This ends the proof. 
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